Model Selection

Visual Language Understanding

# Visual Language Understanding

Blip Arabic Flickr 8k

Arabic image captioning model fine-tuned based on BLIP architecture, specifically optimized for the Flickr8k Arabic dataset

Transformers Supports Multiple Languages

Skywork R1V2 38B

Skywork-R1V2-38B is currently the most advanced open-source multimodal reasoning model, demonstrating outstanding performance in multiple benchmark tests with robust visual reasoning and text comprehension capabilities.

Emova Qwen 2 5 3b

EMOVA is an end-to-end omni-modal large language model that supports visual, auditory, and speech functions, capable of generating text and speech responses with emotional control.

Multimodal Fusion

Transformers Supports Multiple Languages

VL Rethinker 7B Fp16

This model is a multimodal vision-language model converted from Qwen2.5-VL-7B-Instruct, supporting visual question answering tasks.

Transformers English

Qwen2.5 VL 7B Instruct Gptqmodel Int8

A vision-language model based on the Qwen2.5-VL-7B-Instruct model with GPTQ-INT8 quantization

Transformers Supports Multiple Languages

Qwen2.5 VL 72B Instruct GGUF

Qwen2.5-VL-72B-Instruct is a 72B-parameter multimodal large model that supports vision-language tasks, capable of understanding and generating text related to images.

Text-to-Image English

This is an image-to-text conversion model capable of processing both image and text inputs to generate corresponding text outputs.

Qwen2 VL 7B Captioner Relaxed GGUF

This model is a GGUF format conversion based on Qwen2-VL-7B-Captioner-Relaxed, optimized for image-to-text tasks and supports running via tools like llama.cpp and Koboldcpp.

Image-to-Text English

Deepseer R1 Vision Distill Qwen 1.5B Google Vit Base Patch16 224

DeepSeer is a vision-language model developed based on the DeepSeek-R1 model, supporting chain-of-thought reasoning and trained through dialogue templates for visual models.

mehmetkeremturkcan

Emu3 is a multimodal model developed by the Beijing Academy of Artificial Intelligence, trained solely by predicting the next token, supporting image, text, and video processing.

Llama 3 EvoVLM JP V2

Llama-3-EvoVLM-JP-v2 is an experimental general-purpose Japanese vision-language model that supports interleaved input of text and images. This model was created using an evolutionary model fusion approach.

Transformers Japanese

Cephalo Idefics 2 Vision 10b Alpha

Cephalo is a series of vision-based large language models (V-LLMs) focused on multimodal materials science, designed to integrate visual and linguistic data to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.

Transformers Other

Open Llava Next Llama3 8b

An open-source chatbot model trained by fine-tuning the entire model on open-source data, which can be used for research on multimodal models and chatbots.

Cephalo Idefics 2 Vision 8b Alpha

Cephalo is a series of vision-based large language models (V-LLMs) focused on multimodal materials science, designed to integrate visual and linguistic data to facilitate advanced understanding and interaction in human-computer or multi-agent AI frameworks.

Transformers Other

Llava Jp 1.3b V1.1

LLaVA-JP is a multimodal vision-language model that supports Japanese, capable of understanding and generating descriptions and dialogues about input images.

Transformers Japanese

This is a transformers-based image-to-text conversion model, specific functionalities require further details

Llava V1.5 13b Dpo Gguf

LLaVA-v1.5-13B-DPO is a vision-language model based on the LLaVA framework, trained with Direct Preference Optimization (DPO) and converted to GGUF quantized format to improve inference efficiency.

LLaVA is an open-source multimodal chatbot, fine-tuned based on a large language model, supporting interactions with both images and text.

Moe LLaVA StableLM 1.6B 4e

MoE-LLaVA is a large-scale vision-language model based on a mixture of experts architecture, achieving efficient multimodal learning through sparsely activated parameters.

Tiny Llava V1 Hf

TinyLLaVA is a compact large-scale multimodal model framework focused on vision-language tasks, featuring small parameter size yet excellent performance.

Transformers Supports Multiple Languages

Llava 7B Lightening V1 1

LLaVA-Lightning-7B is a multimodal model based on LLaMA-7B, achieving efficient vision-language task processing through delta parameter tuning.

Large Language Model

Pix2struct Ocrvqa Base

Pix2Struct is a visual question answering model fine-tuned for OCR-VQA tasks, capable of parsing textual content in images and answering questions

Transformers Supports Multiple Languages

Featured Recommended AI Models

AIbase

Empowering the Future, Your AI Solution Knowledge Base

English 简体中文繁體中文にほんご

© 2025AIbase